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Abstract 

The chief aim of this paper is to propose mean-field approximations for a broad class 
of Belief networks, of which sigmoid and noisy-or networks can be seen as special cases. 
The approximations are based on a powerful mean-field theory suggested by Plefka. We 
show that Saul, Jaakkola, and Jordan's approach is the first order approximation in Ple- 
fka 's approach, via a variational derivation. The application of Plefka's theory to belief 
networks is not computationally tractable. To tackle this problem we propose new approx- 
imations based on Taylor series. Small scale experiments show that the proposed schemes 
are attractive. 

1. Introduction 

Bayesian belief networks (Pearl, 1988; Lauritzen & Spiegelhalter, 1988) are powerful graph- 
ical representations of probabilistic models. These networks are directed acyclic graphs 
(DAGs) whose nodes represent random variables while the links represent causal influences. 
This association between graph theory and probability serves as an elegant tool for handling 
uncertainty in real life. These networks are increasingly being used in diverse areas from 
image processing (Agosta, 1990) to medical diagnosis (Shwe & Others, 1991). 

The usefulness of these networks relies heavily on solving the problem of inference. A 
large number of algorithms have been developed to efficiently compute the likelihood ex- 
actly, examples include pruning based methods (KjaerulfF, 1998), or the bounded condition- 
ing method (Horvitz, Suermondt, & Cooper, 1989) etc. However these algorithms are slow 
(Jensen, Kong, & KjaerulfF, 1995) when applied to densely connected belief networks(BNs). 
In large networks exact algorithms are intractable, as they require summing over an ex- 
ponentially large number of hidden states. A possible approach to tackle this problem is 
to resort to Markov Chain Monte Carlo based methods like Gibbs Sampling (Geman & 
Geman, 1984; Neal, 1992). This approach yields accurate results but is extremely slow in 
convergence. 

©2001 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved. 



Chiru & Keerthi 



In this paper we will focus on the mean-field theory borrowed from statistical mechan- 
ics. The intuition behind these methods is that a node is relatively insensitive to particular 
settings of the values of its neighbors, but rather depends on their mean-values. This ob- 
servation leads to a simple and often very accurate procedures; (see Jordan, Ghahramani, 
Jaakkola, & Saul, 1997). This technique yields an estimate of the means of the uninstanti- 
ated nodes and also an estimate of the partition function. The main aim of this paper is to 
study a powerful mean-field technique due to Plefka (1982), and apply it to BNs. 

Plefka initially proposed his theory in the context of spin glasses, hence it's application 
to Boltzmann Machines (BMs) (Ackley, Hinton, & Sejnowski, 1985) is straightforward. To 
develop mean-field theories for BNs on the lines of BMs, it is important that we set up a 
framework in which both these networks can be studied. Both BMs and BNs can be seen as 
different realizations of "Graphical models" (GMs). In Section 2 we review the definitions of 
BMs and BNs as special cases of GMs and formulate the associated probability distribution 
as Boltzmann Gibbs distribution. 

Our main results are in Section 3. In this section Plefka's approach is derived from a 
variational perspective. Let Z denote the partition function associated with the Boltzmann 
distribution. The variational perspective corresponds to introducing extra variables and 
deriving a convex function of these variables as an upper bound on the negative logarithm 
of Z . Tightening of this bound leads to a minimization problem for which stationarity con- 
ditions are both necessary and sufficient for global minimum. At the stationary point the 
bound is attained. The convex function mentioned above is computationally intractable. 
It is approximated and the stationarity conditions of this approximate function yields the 
mean-field equations. Plefka's approach provides a systematic way of tractably approxi- 
mating this function to desired orders of accuracy through Taylor series. We show that 
the approximation obtained by Saul et al.'s approach is the same as that of the first order 
approximation in Plefka's approach. The extension of Plefka's approach is not direct. We 
propose a new activation-independent scheme based on Taylor series methods to achieve 
this. In Section 4 experiments on small networks were conducted. These experiments show 
that the obtained approximations are quite attractive. 

2. Review of BMs and BNs 

In this section we establish a framework in which both BMs and BNs can be studied. 
Indeed both BMs and BNs can be seen as different types of GMs, the difference being in 
the connectivity and the energy function used in computing the probability. 

In this paper we will restrict ourselves to binary valued units. We thus define a GM 
to be consisting of a fixed number of stochastic units. Each unit has an associated binary 
random variable S G {0, 1} (or {1, —1}). Each state S = {Si, 5*2, • • • , 5jv} of a GM has an 
associated "energy function" , which is used to define the Boltzmann Distribution: 



The denominator Z, called the partition function, is the normalization factor defined by 



e 



E(S)/T 



P{S) 



z 




Z = J2 s e 



E(S)/T 
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GMs have units called "hidden", and "visible" units. The state vector S can be thought 
of as being divided into the pair of H (hidden units) and V (visible units), or S = {H, V}. 
During operation phase, visible units are clamped, that is V = v, or S = {H,v}. The 
distribution associated with the hidden units for the clamped phase is 

e -E(S)/T 

"W* = "W < 2 > 

where Z c (v) = J2f} y-ff e ~ E ^ S ^ T is used for "inferring" the values of hidden units. 
When E is substituted with 

N N N 

E(S) = -0.5 J2 E w iASj - E ( 3 ) 

i=l j = l i=l 

and also the restriction Wij = Wji, is imposed on the weights the well known BM is obtained. 

In this paper we will restrict ourselves to a special class of BNs with the following energy 
function 

N 

E(S) = - £{S,-ln/(M,-) + (1 - Si) Ml - f(Mi))} (4) 

8 = 1 

where Mi = w ijSj + hi, and / is a function from a subset C of the real line denoted 

by 5ft, to the interval (0, 1), i.e., 

/ : C C (0,1). 

We will assume that / is an analytic function. There is also the restriction on weights 
Wij = 0, if i < j. Sigmoid and noisy-or networks are two such special networks (Neal, 
1992), with activation functions o(x) = and p(x) = 1 — e~ x respectively. In sigmoid 

networks x can be any real number, while in noisy-or x is restricted to be non-negative; this 
constraint is enforced by forcing the weights and biases to be non-negative. From here on 
whenever we refer to BNs it would be with reference to (4). 

For inference the crucial things to be evaluated are Z c and Z. The first item requires 
summing 2 H terms while the next term requires summing 2 N terms (where H is the number 
of hidden units and N is the total number of units). The expressions for partition functions 
are similar in nature both in the clamped and the undamped phase. The only difference 
is that in the clamped phase the summation is over hidden units, while in the undamped 
phase it is over all the units. Thus throughout the remaining paper we will talk only about 
evaluating Z. 

As mentioned above computing Z requires an exponential summation operation. Thus 
exact calculation of Z is intractable. Sampling based methods offer a possible way out. 
But these methods are computationally very expensive. Another alternative is to look 
for deterministic approaches, based on mean-field theory as advocated by (Peterson & 
Anderson, 1987; Saul et al., 1996; Kappen & Rodriguez, 1998). The methods proposed by 
(Peterson & Anderson, 1987; Kappen & Rodriguez, 1998) are applicable to BMs, while Saul 
et al.(1996) developed a scheme which has been applied to both BNs and BMs. The next 
section will be devoted to Plefka's method. We intend to study it, understand its relations 
with existing theories and apply it to BNs. 
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3. Plefka's Method and BNs 

Recently Kappen and Rodriguez(1998) introduced, to the neural network community, a 
powerful approximation method due to Plefka. In this section we will present Plefka's 
method in a different way, so that it can be extended to BNs. We will also demonstrate, 
later in this section that Plefka's method is more general than SJJ approach. 



3.1 Plefka's Method 

Define a new energy function 

N 



E{9, 1 ) = 1 E/T-Y,9iSi (5) 



8 = 1 



(Since the dependence of various functions on S, T and w is quite obvious, in what follows 
we will not mention these as functional arguments. We do this so that the dependence 
on other variables such as j and 9 stands out more clearly.) The corresponding partition 
function and probability distribution are: 

Z=J2e~ S (6) 



Pi = -j- (7) 

We treat 9{ as external variables. This helps us in extending the approach to derive mean- 
field theories for BNs. Our approach differs from that of (Kappen & Rodriguez, 1998), 
which identifies 9{ with hi, the bias variables in BM, see equation (3). 

The motivation for bringing in j and 9 and defining the above functions is as follows. 
If the energy function is of the linear form, J2i@iSii then the corresponding probability 
distribution is factorial, (of course one can use other functions and still the probability 
distribution can be factorial), and computations can be done tractably. The parameter j 
can be thought of as a homotopy parameter that smoothly brings in the true energy function 
into picture as j is increased from 0. To get the original energy function from E we need 
to set 

0i = OVie{l,2,---iV} , 7 = 1 (8) 

For convenient manipulation, it is useful to define the function, 

C{0,y) = -lnZ + J2^ t (9) 

i 

where _ 

». = <•%, = ^ do) 

Since u, the mean vector, is physically more meaningful than 9, it is appropriate to consider 
it as free and treat 9 as being dependent on u. To facilitate such a treatment the following 
assumption is made. 
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Invertibility Assumption . For each fixed u and 7, (10) can be uniquely solved for 
9. 

Let 9i = Ki(u,j) denote the inverse of (10). When the inverse function is used, C 
transforms to a function of u and 7: 

G{u, 1 ) = C{l 1 )\ s= ^ i) (11) 

Thus 

G{u,i) = -\nZ + Y d i u i (12) 

i 

where 9{ is considered to be a function of w,and 7. Note that 

§^ = E §p- = Et-^ + + E = * (13) 

ati; . CW, OU; ^ 06; , CW, OU; 

J J 3 k J 1 

The relationship between 6 and u, more precisely equations (10), (13), and the invert- 
ibility assumption, is well known and is called the Legendre Transformation (Rockafeller, 
1972). Central to this transformation is the Invertibility assumption. The validity of this 
assumption at 7 = 1 is important for the techniques discussed in the paper. 

In the above we have made use of the fact that ^ is the inverse of §5. Hence the 
Qi = Vi, requirement mentioned in (8) translates to 

? = fl, = 0Vi (14) 

OUi 

Clearly, when this constraint is enforced and 7 is carried to 1, we get G = —InZ. 

The inversion of (10) is intractable for 7^0, which makes it impossible to write 
an algebraic expression for k and hence G. To circumvent this problem, an approximate 
description of G is built by using the fact that the inversion of (10) is quite straightforward 
at 7 = 0, and expanding G(u,j) around 7 = using Taylor Series. 

At 7 = 0, 

e t s t 

po = Utt^ (15) 

i 

and equation (10) becomes 



' 1 + e°i 

The inversion of this equation can be carried out explicitly to obtain d{ = In 1 "' and 

G(u, 0) = In Ui + (1 - Ui) ln(l - u t )] 

i 

Also, the distribution in (15) can be expressed as a function of in factorial form 

pb=n«f i ( i -«.-) i_Si 
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The truncated Taylor series approximation of G is 

M k f)kr-< 

Gm(u, 1 ) = G(u,0) + J2 1 



-J kl dj k 



(17) 



7=0 



Gm is a function of u and 7, and can be viewed as an Mth order approximation of G. Let 
us now consider the requirements mentioned in (8). Setting 7 = 1 is straightforward to 
implement. The requirement, 9i = 0, for all i, can be approximately enforced using (14) as 



dG dG M 

du-j du-j 



Vi 



:i8) 



These equations are used to set up the fixed point equations. Henceforth we will refer them 
as the mean-field equations. 

The feasibility of the scheme lies in computing the partial derivatives of G with respect 
to 7 at 7 = 0. In this paper we will restrict ourselves to M = 2. The relevant expressions 
(Plefka, 1982; Bhattacharyya k Keerthi, 1999b) are 



dG 
dj 



7=0 



(E) 



Po 



d 2 G 



d 7 2 



7=0 



8 = 1 



U t (l - Ui) 



(19) 



(20) 



Expressions for other derivatives can also be derived in a straightforward way. For BM 
these derivatives are tractable. For BNs computing these terms is problematic, and so one 
is forced to make further approximations. New approximation schemes that we have devised 
will be discussed later. 



To evaluate the quantities required for learning one has to compute d J nZ ■ Let us consider 



the function 



f(w) = -G(u*,l) = lnZ 



where u* is no longer a free vector, but it is considered to be a function of w given by (14). 
Note that 

dlnZ df (dG(u*,l) . ^ dG dul 



dw 



dw 



dw; 



+ E 



k du\dw l3i 



Using (14) we get 



dlnZ 



dG(u*,l) 



w; 



d 



w; 



The right hand side is intractable, but can be approximated using Gm at 7 = 1. The 
derivatives required for learning can thus be computed by the approximation: 



dlnZ 



dG(u*, 1) 



dG M (u, 1) 



dwij dwij dwij 



where u is a solution of the mean-field equations, (18). 
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3.2 Saul, Jaakkola and Jordan (SJJ) Approach and Plefka's Method 

Saul et al.(1996) pioneered the application of mean-field theory to BNs. In this subsection 
we will show that their approach can be seen as a first order approximation of Plefka's 
method. To establish this we present a new variational derivation of Plefka's method. 

3.2.1 Saul, Jaakkola and Jordan (SJJ) Approach 

In this section we review Saul et al.'s work. They adopt a variational approach: form a 
function (with extra variables) which is a lower bound for In Z and choose the variables so as 
to maximize the function. An approximate distribution, q(S, u), is chosen with a parameter 
vector u. The parameter u, is tuned so that q(S,u) is as close as possible to p(S). As 
a measure of closeness between two distributions Kullback Leibler distance is used. It is 
defined by 

D( q ,p) = J2l(S,u)ln q -^ (21) 

The u is chosen so that D(q,p) is minimized. q(S, u) is chosen to be factorial, i.e. 

N 

q(S,v) = l[u? i (l-u i ) 1 - Si (22) 

8 = 1 

When p is the Boltzmann distribution the Kullback Leibler distance takes the form 

D(q,p) = \nZ + L s {u) (23) 

where 

1 N 

L s (u) = -(S(s)), (s) + ^(ti t -lnti t - + (l-ti t -)ln(l-ti t -)) (24) 

8 = 1 

Using the fact that D(q,p) > we get a lower bound on In Z: 

\nZ>-L s (25) 

L s is minimized with respect to u to get an upper bound on — In Z. The vector u is 
determined by solving the following set of equations. 

— - = 26 

OUi 

Though this approximation is well known in statistical physics literature (Parisi, 1988) 
as the naive mean-field theory, Saul et al.'s application of this theory to BNs is new and 
interesting, hence we will refer the above approach as the SJJ approach. A look at (24) 
shows that this approach has computational feasibility only when (E) q can be expressed as 
a function of u. For BNs, the calculation of (E(S)) q is intractable, even when q is factorial. 
Saul et al. suggest exploiting activation-dependent convexity properties to develop further 
approximations to L s . Thus one has to tailor the SJJ approach for various activation 
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functions. An approximation, L sapprox , is developed keeping two things in mind: (1) it is 
close enough to L s ; and (2) it is a tractable function of u. Then (26) is approximated by 

FIT 

v^sapprox _ q 

Even when u is chosen to minimize L s , the equality in (25) is attained if and only if p is 
factorial. This follows from the fact that D(q,p) = if and only if p = q, and the fact that q 
is chosen to be factorial. Thus SJJ approach is inadequate for obtaining an arbitrarily close 
approximation to In Z, for a general distribution, p. One way of overcoming this drawback, 
is to treat a small number of variables exactly and approximate the rest; see (Jaakkola & 
Jordan, 1999; Saul & Jordan, 1996). In this paper we study Plefka's approach which uses 
Taylor series to give a systematic way of building an arbitrarily close approximation to In Z. 
In the subsection we will rederive Plefka's method from a variational perspective, and show 
that SJJ approach is a special case of Plefka's approach. 

3.2.2 Plefka's Method: A Variational Perspective 

In this section we rederive Plefka's method from a variational perspective. The inequality 
on —InZ derived in the process, and the establishment of convexity of the approximation 
function with respect to the variational variables are new contributions made in this paper. 
Furthermore, we also demonstrate that the SJJ approach is a special case of Plefka's method. 

As in Section 3.3 let 7 be a real parameter that takes values from to 1. Let us define 
a 7 dependent partition and distribution function, 

Zl = J2e- lE{ * )/T (28) 
s 

e -lE(S)/T 

Pi = y (29) 

Note that Z\= Z and p\ = p. Introducing an external real vector 6, let us rewrite (28) as 

Z, = £ ° g t -S,»*Z (30) 

s 

where Z is given by (6). Using Jensen's Inequality, (e~ x ) > e~( x \ we get 

Z„ f = Zj2Pi e ~^< 6 ' S ' > Ze-^ e > u > (31) 
s 

Taking logarithms on both sides of (31), we obtain 

logZ^> log Z-^OiUi (32) 

i 

Note that the right hand side function is nothing but —C(6,j) as defined in (9). It is also 
worth noting that the motivation for defining the C function comes much more naturally 
here. 
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If the invertibility assumption holds then we can use u as the independent vector (with 
9 dependent on u) and rewrite (32) as 

-lnZy < G(u,j) (33) 

where G is as defined in (11). This gives a variational feel to Plefka's method: treat u as 
external variable vector and choose it to minimize G. It can be further proved that G is a 
convex function in u. To see this define the Hessian of G as H,-,- = « — « — . Differentiation 

L J OU l OUj 

of (10) with respect to u, yields 

I = BH (34) 

where 

BlJ = [(SiSj) - (SiXSj)] (35) 

Being a covariance matrix, B is positive semidefinite. Equation (34) ensures that both B 
and H are non-singular. Hence H is positive definite implying that G is convex. 

As discussed in Section 3.3, the difficulty in inverting (10) for 7^0 remains and is 
solved by the Taylor series approach as discussed in that section. If we use the first order 
approximation the SJJ approach is obtained. For M = 1, we have 



7 

lti^m, 7; = i^yu, v) -t- 

where 



G 1 (u, 1 ) = G(u,0) + ^(E) iio (36) 



N 



G(u, 0) = ^2(v,i In Ui + (1 - Ui) ln(l - u t )). 



8 = 1 

In fact G\ overestimates the G function 



-logZ 7 < G(u,j) < Gi(ti,7). (37) 

To see this (also see (Bhattacharyya & Keerthi, 1999a)) note that equation (33) can also 
be interpreted in terms of the divergence between the two distributions, p 1 and p 7 ,that is 



D(p 1 ,p 1 ) = ^p 7 ln ^ = In Z 1 + G(u, 7) (38) 

Also noting that 



s ' 



L>(p ,P 7 ) = ^p ln — = In Z 7 + Gi(u, 7) (39) 

s Pl 

D(po,pJ = J2p ln^ = G^u, 7) - G(u,j) (40) 

s Pl 

and using the fact that the divergence D is always non-negative (37) is obtained. 
At 7 = 1 

G 1 (u,l) = L s (u) (41) 

where L s is the objective function obtained by SJJ approach; see (24). Note that at 7 = 1, 
(37) establishes the inequality (25) in the SJJ approach. It is thus clearly established that 
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SJJ approach is a first order approximation to Plefka's approach. Incidentally Barber and 
van de Laar (1999) also rederived the SJJ approach by using a cumulant expansion. 

Using (38), (39), (40), an alternate information geometric derivation of Plefka's method 
can be constructed. The variational derivation presented here also helps in establishing links 
with other refinements like TAP and linear response correction. For a detailed discussion 
of this points refer to (Bhattacharyya & Keerthi, 2000, 1999b). 

3.3 Mean-field Approximations for BNs 

In this section, mean-field theory for BNs is developed using Plefka's method discussed 
in the previous sections. For BNs each Si is influenced by X)}=i w ijSj + hji which can 
be viewed as fields. The mean-field approximation then suggests that these probabilistic 
fields may be replaced by their mean values, that is Y^j=i w ij u j + hi- Keeping this in mind, 
Plefka's method is adapted to develop mean-field schemes for BNs. In Section 3.2 we 
suggest an approximation method which can be used to compute all the quantities required 
for implementing Plefka's approach. Our approach is quite general and does not depend 
on the form of activation function. For other activation independent approaches regarding 
the application of mean-field theory to BNs see (Haft, Hofmann, & Tresp, 1999; Kearns & 
Saul, 1998). Since for belief network operation T is set to 1, we will drop T from all further 
equations. 

3.3.1 A New Scheme Based on Plefka's Method 

Plefka's method, as presented in Section 4, is not directly useful for BNs, because of the 
intractability of the partial derivatives at j = 0. To overcome this problem, we suggest 
a method based on Taylor series expansion. This is a very general method and is not 
dependent on the activation function. This method enables calculation of all the necessary 
terms required for extending Plefka's method for BNs. 
Let us define a new energy function 



N 



E(P, S, u,w) = - In f(M0)) + (1 - Si) ln(l - f(M0))} 



(42) 



where < (3 < 1 



i-1 



Uj) + Mi 



3 = 1 



and 



i-1 



WijUj + hi. 



3 = 1 



Since (3 is the important parameter, E((3, S, u, w) will be referred to as E((i) so as to avoid 
notational clumsiness. Note that 



N 
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and E(l) = E. We use a Taylor series approximation of E((i) with respect to (3. Let us 
define 

C pkQkfi 



Ec(P) = E(0)+J2 



(3=0 



k=l H 

If Ec approximates E, then we can write 

Let us now define the following function 

A( 7 ,M = -ln£ e -^+£^+£ ( 



(43) 



(44) 



(45) 



The 9i are assumed to be functions of u, (3,j, which are obtained by inverting the following 
equations: 

Uk = J2 S kP 1 p VA; (46) 
s 

where 

By replacing E by Ec in (45) we obtain Ac 

A c ( 7 , /3, u) = - In £ e-^+Ei + £ < 



(48) 



where the definition of u is obtained by replacing E by E'c*. In view of (44) one can consider 
Ac as an approximation to A. This observation suggests an approximation to G. 



G(j, u) = A(j, 1, u) « A c (j, 1, «) 
Then the mean-field equations can be stated as 



(49) 



dG _ dA 

du-i du; 



dA 



c 



(3=1 



di 







(3=1 



(50) 



The required terms needed in the Taylor expansion of G in j can be approximated by 

G(u, 0) = A(0, 1, u) = A c (0, 1, u) 



d k G 



dj k 



d k A 



7=0 



dj k 



d k A c 



7 =0,/?=l 



7 =0,/?=l 

The biggest advantage in working with Ac rather than G is that the partial derivatives of 
Ac with respect to j at j = and (3 = 1 can be expressed as functions of u. We define 



M k ak A 

Gmc(u, 1 ) = A c (u,0) + J2 1 ° 



k\ dj k 



(51) 



7 =0,/?=l 
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In light of the above discussion one can consider 



Gm ~ Gmc 

and hence the mean-field equations can be stated as 

dG OGm 9Gmc 



di 



di 



di 







(52) 



(53) 



To develop a feel for the approximations let us consider the case where M = 1, the first 
order case. Recall that at j = 0, the distribution is given by 



Po 



n 



1 



dA 



c 



7 =0,/3=l 



(^(% o = (i?(0))~+ErT 



1 / d k E 



k=l 



kl \ d(i k 



(54) 



13=0/ 



Po 



All the terms can be computed as a function of u. The appropriate expressions are derived 
in appendix C. All the above terms can be used in computing an approximation to G\. 
Using (51) we obtain 



c 



G ic (u, 7) = G(u, 0) + 7 (E(0))~ + ^ ^ 



1 / d k E 



^ kl \ dfi k 



p=ol 



Po ■ 



Thus using (52) we have 

Gi(t)«Gic(t) 
The mean-field equations are obtained by using (53) to get 



dG t 



c 



di 



The fixed point equations thus obtained are 



c 







1 / d k E 



k=i 



kl \ d(i k 



(3=0 




(55) 



(56) 



(57) 



(58) 



Note that the scheme resulting from C=l approach can also be obtained from a saddle 
point approach, (Bhattacharyya & Keerthi, 1999a). A confusion might arise regarding 
the interpretation of Gm The variational derivation might tempt one to believe that since 
G is an upper bound to — In Z, and Gm is an approximation to G, then Gm is also an 
upper bound. Such arguments are not correct. It is difficult to establish whether Gm can 
be considered as an upper or lower bound for a given M and E. In fact even for M = 2, 
one can at most say Gi < G\, (Bhattacharyya & Keerthi, 1999b), for a general E. 

We interpret Gm as a close approximation to G as a function of u. If the approximations 



are close enough then the gradients g^- must also be well approximated by 
ultimately leads to (18). A similar interpretation applies to Gmc- 



which 
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Application of Plefka's method to any stochastic system, i.e. for any E, relies on the 
validity of the invertibility assumption which depends on the convergence of the Taylor 
series expansion of G in 7. More importantly the radius of convergence p has to satisfy 
p > 1 (Plefka, 1982). It is still an open question whether the radius of convergence for the 
energy function described in (4) or in (43) satisfies this condition. 



4. Examples and Numerical Experiments 

In this section we apply the various approximation schemes developed in the previous section 
to two different class of networks, namely the sigmoid network defined by the sigmoid 
activation function is a(x) = 1+ ],- x , and the noisy-or network defined by the activation 
function p(x) = 1 — e~ x , x > 0. In this paper we will restrict ourselves to three different 
approximation schemes having the objective functions Gn, G12, ^22- It is interesting to 
compare the approximations for sigmoidal BNs as derived in (Saul et al., 1996) with those 
obtained here. Of particular interest is G\2- The fixed point equations resulting from this 
objective function are 



Ui 



N 
k=i+l 



a(M k ))w kt + K ki } 



(59) 



K kt = -0.5a(M k )(l- a(M k )) 



k-i 

2a(M k ))w kt w kl u l{ 1 
1=1 



u l ) + w 2 kl {l-2u l )\ (60) 



If k < i then K k { = 0. Approximation for sigmoidal BNs as derived in (Saul et al., 1996) 
has the following fixed point equation 



Ui 



N 
k=i+l 



^k)w kt + Kki] 



(61) 



where again Kki = , if k < i; just like (60), K k { is also dependent on m, I = 1, • • • , k — 1. 

The expressions for K k { (refer to Saul et al., 1996 for exact expressions) look very 
different from K k {. It may be still possible that numerically they may be very close. In fact 
the similarity in structure in (59) and (61) is worth noting, and experimental results show 
that as far as approximation of In Z goes they yield close results. It is possible that the t, k 
and a(M k ) play the same role. One can refer to (Saul & Jordan, 1999) for a discussion on 
this point. It is a matter of future study to investigate the above relationships in detail. 



4.1 Experimental Results 

To test the approximation schemes developed in Section 3, numerical experiments were 
conducted. Small Networks were chosen so that In Z can be computed by exact enumeration. 
For all the experiments the network topology was fixed to 2 X 4 X 6; see figure 1. This 
choice of the network enables us to compare the results with those of Saul et al.(1996). To 
compare the performance of our methods with their method we repeated the experiment 
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Figure 1: Three layer BN (2x4x6) with top down propagation of beliefs. The activation 
function was chosen to be one of sigmoid and noisy-or. 



conducted by them for sigmoid BNs. 10, 000 networks were generated by randomly choosing 
weight values in [—1,1]. The bottom layer units, or the visible units of each network was 
instantiated to zero. The likelihood, In was computed by exact enumeration of all the 
states in the higher two layers. The approximate value of — In Z was computed by Gmc\ 
u is computed by solving the fixed point equations obtained from (53). The goodness of 
approximation scheme was tested by the following measure 

We implemented 3 approximation schemes Gn, G12, <J22- Relevant expressions are worked 
out in Appendix B. For a proper comparison we also re-implemented the SJJ method. The 
goodness of approximation for the SJJ scheme is evaluated by substituting Gmc, m (62) by 
L S approx, mentioned in Section 3.2.1, for specific formula see (Saul et al., 1996). The results 
are presented in the form of histograms in Figure 2 and Figure 3. We also repeated the 
experiment with weights and biases taking values between -5 and 5, the results are again 
presented in the form of histograms in Figure 4 and Figure 5. The findings are summarized 
in the form of means tabulated in Table 1. 

For small weights G12 and the SJJ approach show close results, which was expected. 
But the improvement achieved by the G22 scheme is remarkable, it gave a mean value of 
0.0029 which compares substantially well against the mean value of 0.01139 reported by 
Bishop et al.. They suggest the use of mixture distributions which requires introduction 
of extra variational variables; more than 100 extra variational variables are needed for a 5 
component mixture. This results in substantial increase in the computation costs. On the 
other hand the extra computational cost for G22 over G12 is marginal. This makes the G22 
scheme computationally attractive over the mixture distribution. 

To study the robustness of the schemes the weight scales were increased. As the weight 
scales were increased degradation was noticed in all the four methods. The point to be 
noted is all the three methods appear to be more robust than the SJJ approach. But unlike 
the small weights case G22 did not perform well, it is a poor approximation for large weights. 
The best results are obtained by G12 scheme. 

During the course of experimentation it was found that, for some networks with large 
weights, the fixed point equations converged to an order 2 fixed point; while solving the 
fixed point equations u oscillates between two vectors u\ and u^- To solve this problem we 
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small weights [—1, 1] 


large weights [—5,5] 


G 11 


-0.0404 


-0.0440 


G\2 


0.0155 


0.0231 


G22 


0.0029 


-0.0456 


SJJ 


0.0157 


0.0962 



Table 1: Mean of £ for randomly generated sigmoid networks. 10,000 networks were ran- 
domly selected by choosing the weights from the range [—1, 1]. The experiment 
was repeated by choosing the weights from a larger range [—5,5] 




Figure 2: Histograms for G\c and SJJ scheme for small weights, taking values in [—1, 1], for 
sigmoid networks. The plot on the left show histograms for £ for the schemes G\\ 
and G\2- They did not have any overlaps and they clearly show the improvement. 
G11, gives a mean of -0.040 while G12 gives a mean of 0.0155. The plot on the 
right shows the histogram for the SJJ scheme. The mean is given by 0.0157. 



adapted the following strategy. As soon as the oscillations were detected we stopped the 
fixed point equations, and restarted it with a new point «g. This new point was chosen by 
searching along the line between u\ and which gave a minimum value of £. Once this 
was done convergence to an order 1 fixed point occurred. 

Numerical experiments were also conducted for noisy-or networks. For the approxima- 
tion schemes to work well it should be able to approximate In Z over a range of values. This 
motivated the experiment described below. Noisy-or networks, whose topology is given by 
Figure 1 were randomly generated by choosing weights and biases randomly from and 
0.25. For each network In Z was computed for all the bottom layer states. Out of all the 
states, two states were chosen such that In Z is maximized and minimized respectively. The 
bottom layer was instantiated with each of the chosen two states, and approximations to 
the likelihood were then computed. This experiment was repeated for 10, 000 such net- 
works. Again £ is used as a measure of goodness of approximation. Figures 6 and 7 show 
the corresponding histograms. We repeated the experiment for a different weight range, 
[0.2,0.8]. Figures 8 and 9 show the relevant histograms. The histograms are summarized 
by mean values of £ tabulated in table 2. In this case it appears that G22 is indeed a poor 
approximation. G12 almost always gave the best results showing that even for noisy-or ac- 
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-2 2 4 6 8 10 12 14 16 

x 1[T 3 

Figure 3: Histogram of £ for G22 applied to sigmoid network with small weights. The mean 
obtained is 0.0029. Note that £ is scaled by 10 3 in this figure 




Figure 4: Histograms for the G\c an d SJJ schemes for large weights, taking values in [—5, 5] 



for sigmoid networks. The histogram on the left shows £ for G\\ scheme having 
a mean of —0.0440, one at the center is for G12 scheme having a mean of 0.0231, 
and one at the right is for SJJ scheme, having a mean of 0.0962. 





Small weights 
[0., 0.25] 


large weights 
[0.2,0.8] 


In Z max case 


In Z m i n case 


In Z max case 


In Z m i n case 


G11 


0.001 


-0.061 


-0.156 


-0.029 




0.028 


0.011 


0.052 


0.015 


G22 


0.445 


0.320 


0.090 


0.211 



Table 2: Mean of £ for randomly generated noisy-or networks. 10, 000 networks were ran- 
domly selected by choosing the weights from the range [0., 0.25]. For each network 
the visible states with maximum In Z and minimum In Z were identified. The 
visible nodes of each network is then instantiated with the identified states, and 
the corresponding log Z is approximated by various schemes. The experiment was 
repeated by choosing the weights from a different range [0.2, 0.8] 
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Figure 5: The histogram shows £ for G22, having a mean of —0.0456. The network is 
sigmoid with large weights 




Figure 6: Histograms for visible state with maximum logZ, for noisy-or networks with 
weights taking values in [0,0.25]. The histogram on the left shows £ for G\\ 
scheme, at the center G12 scheme and that at the right Gii- The scheme G\\ 
gave a mean of 0.001, G\i gave a mean of 0.028 and G22 gave a mean of 0.445 




Figure 7: Histograms for visible state with minimum log 2", for noisy-or networks with 
weights taking values in [0,0.25]. The histogram on the left shows £ for G\\ 
scheme, at the center G\% scheme and that at the right Gii- The scheme G\\ 
gave a mean of -0.061, G\i gave a mean of 0.011 and G22 gave a mean of 0.320 
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Figure 8: Histograms for visible state with maximum logZ, for noisy-or networks with 
weights taking values in [0.2,0.8]. The histogram on the left shows £ for G\\ 
scheme, at the center G12 scheme and that at the right Gii- The scheme G\\ 
gave a mean of -0.156, G\i gave a mean of 0.052 and G22 gave a mean of 0.09 



□EE IE 

Figure 9: Histograms for visible state with minimum logZ, for noisy-or networks with 
weights taking values in [0.2,0.8]. The histogram on the left shows £ for G\\ 
scheme, at the center G12 scheme and that at the right Gii- The scheme G\\ 
gave a mean of -0.029, Gyi gave a mean of 0.015 and G22 gave a mean of 0.212 
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M=1,C=1 M=1,C=2 M=2,C=2 SJJ 



M=1,C=1 M=1,02 M=2,C=1 



Figure 10: The plot on the left shows True log likelihood divided by the number of patterns 
for Gn, G12, G22 and SJJ schemes after training on sigmoid networks; the plot 
on the right shows the true log likelihood divided by the number of patterns for 
noisy-or networks 



tivation function it is the best. Again it should be mentioned that for some of the networks 
the fixed point equations converged to an order 2 fixed point, the heuristic described before 
was used to solve this problem. 

Of the three schemes G12 is the most robust and also yields reasonably accurate results. 
It is outperformed only by G22 in the case of sigmoid networks with low weights. Empirical 
evidence thus suggests that the choice of a scheme is not straightforward and depends on 
the activation function and also parameter values. 

To study the learning capabilities of the various schemes proposed we took up a toy 
problem suggested by Hinton et al.(1995) involving binary images. These binary images 
are of size 4 X 4 in which each image consists of either vertical or horizontal bars with 
equal probability, with each location of the bar occupied with probability of 0.5. We took 
a 1 X 8 X 16 network and tried to learn it using both the sigmoid and noisy-or activation 
functions. Number of patterns used was 2000, while the number of epochs was 500. The 
experiment was repeated for 10 different networks. For each network true likelihood was 
computed by exact enumeration. The SJJ method yielded lower likelihoods in almost all the 
cases. It thus appears that the three proposed schemes have a better learning performance 
than the SJJ approach. The results are summarized in figure 10. 



5. Discussion 

In this section we summarize the contributions of this paper, and identify issues for future 
research. The main contributions of this paper are presented in Section 3. In Section 3 
Plefka's method is introduced, re-derived from an variational perspective and applied to 
BNs. Plefka's approach gives a systematic way of building an arbitrarily close approxi- 
mation to — In Z. However it should be noted that the effort needed to evaluate higher 
order terms increases with the order and might be even exponential for an arbitrarily close 
approximation. 
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The variational derivation establishes that SJJ approach is a special case of Plefka's 
approach, thus serving as a link with the existing theory. This derivation process does 
not make any assumptions regarding the structure of energy function and hence it is also 
applicable to BNs. The validity of Plefka's method is subject to the condition, that radius 
of convergence should be greater than 1; see Section 3. It is still an open question whether 
one can prove that such condition holds for the BN energy function. 

Application of Plefka's theory to BNs is not straightforward, it requires computation 
of some averages which are not tractable. We presented a scheme in which the BN energy 
function is approximated by a Taylor series, which gives a tractable approximation to the 
terms required for Plefka's method. Various approximation schemes depending on the 
degree of the Taylor series expansion are derived. Unlike the approach in (Saul et al., 
1996), the schemes discussed here are simpler as they do not introduce extra variational 
variables; compare equations (59) and (61). Another positive aspect of these approximations 
is that they are general and not activity function dependent; hence they are applicable to a 
broad class of BNs. Of course for a given activation function it might be possible to come 
up with tailor-made approximations which are better than the schemes discussed here. But 
empirical evaluation on small scale networks show that the quality of approximations is 
better than those obtained from other variational methods. 

The computational simplicity, robustness and generality make these schemes very attrac- 
tive. Unfortunately theoretical guarantees regarding the validity of Taylor series expansion 
in (3 are missing. This is another open issue which needs to be addressed in the near future. 
It would be interesting to see how these schemes perform on real world datasets. We are 
presently exploring the applicability of these methods on the hand-written digits data-base. 
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Appendix A: Expressions for 1st Order Approximation for BNs 

In this section we present expressions for the "first order approximation" method introduced 
in Section 3. To compute (X-f), the average being taken over the factorial distribution 

Tlf=i M f J (1 — u j) 1 ~ Sj > we define a random variable Ri = e xx ' where X{ = X)j=i w ij{'^j ~ u j)- 
Now 

N 

(R t ) = J2 eXX 'Ii u f( 1 - u j) 1 ~ S > (63) 

S j=l 



d k (R l 



A=0 



Ptf> (64) 



i-l 

(Ri) = "[[{e^^-^Uj + e~ Xwi M (1 - Uj )} (65) 

3 = 1 

(X-) = (66) 
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8-1 
3 = 1 

Differentiation of E in (42) and taking average with the factorial distribution yields 



(67) 



d k E 



d(i k 



N 



Mi){Xl 



13=0/ 



8=1 



where 



g kt (u tl Mi) = ^'^fcln f(Mi) + (1 - ti,-)^^ln(l - /(!;)) 

8M; 8M; 



(68) 



(69) 



The first term gn term is irrelevant beacause of (66). In our implementations g^i term is 
used, its expression is 



g 2 i{u ll Mi) = {- 



Ui 



/(M 8 ) 2 (1 - /(M 8 )) 2 
Thus by (66) we have 



1 Ui }f(M i ) 2 + { Ui 



1 - u; 



f(M t ) 1 - f(Mi) 



}/"(M,-) (70) 



(71) 



(3=0/ 



N 



Gic = ^2[u t ln Ui + (1 - ln(l - u t )] 



8 = 1 



N 



$>,-ln f(Mi) + (1 - u t ) ln(l - /(M,-)] 



8 = 1 

C N 



fc=l 8=1 



(72) 



Fixed point equations are obtained by putting d g^ c = 0, and also noting the point that 
(Xi) = 0. 



Ui = a 



wTO + MwM f {Ml) + S ^ S Mi)<Xi } 



(73) 



Appendix B: Mean-field Expressions for NOISY-OR and SIGMOID BNs 

In this section we present formulae for GmcJ ? C = 1,2, M = 1,2 that was used for the 
experiments described in Section 4. 

N N 

Gn = ^(m In ^ + (1 - u t ) ln(l - u t )) - In /(!,-) + (1 - ln(l - /(!;)) (74) 

8=1 8=1 
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N N 

8=1 8=1 

/(M,-)2 (1 - f(Mi)) 2 J 1 ^ 



+ TtSfT - ^TT^T ) f'Wi) \ (X?) (75) 



1 ~ Uj 

\f(Mi) l-f(Mi) 



where (Xf) = ^(1 - u t ). 

It is a matter of detail to plug in the appropriate functions for a specific network, like 
the sigmoid or noisy-or. Since it is too cumbersome we do not present explicitly the terms 
required for G22, for a general activation function. Instead we present the expressions used 
for the sigmoid and noisy-or functions. 

For Sigmoid network 



G22 = G 12 -0.5j2(J2™^-^)(Mi{u t -a{M l )) + w u u t {l 



N /8-1 
8 = 1 \/=l 

H+V t-1 

+ E J2 w 3kW t kUk{l - u k )(u t - a(M t )(u J - a(Mj)) 

j = l k=l 

H+V H+V \ 

"E Y {Mui-viM^ + M^iuj-aiM^w^uUl-u^] (76) 

j = i+l / = 8 + l / 



For noisy-or network 



N i-1 / t-. \ 

Ui(l - Ui Ui 



G 22 = G12 -0.5^(^ wfjUj (1 - Uj) i-^=^- + jjj=j ~ 1) M 1 + e 



M 



8=1 \j=l 

H+V t-1 

H+V I H+V _ \ 

-EE w k u-^-i) + (-^--i)iog(i + e M ')) W]t u t (i-um 
j=i+i \k=i+i jy Mk > jy M i> ) 

In both the expressions (76) and (77) t is chosen to be either i or j, whichever is lower; 
also the activation function, f, in (77) is f(x) = 1 — e~ x . 

The vector u is determined by solving fixed point equations obtained by setting 

9Gmc _ q 
dui 

The gradients required for learning is evaluated by d Q^f c at the fixed point. 
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